812 research outputs found

    Modeling and visualizing uncertainty in gene expression clusters using Dirichlet process mixtures

    Get PDF
    Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture (DPM) models provide a nonparametric Bayesian alternative to the bootstrap approach to modeling uncertainty in gene expression clustering. Most previously published applications of Bayesian model-based clustering methods have been to short time series data. In this paper, we present a case study of the application of nonparametric Bayesian clustering methods to the clustering of high-dimensional nontime series gene expression data using full Gaussian covariances. We use the probability that two genes belong to the same cluster in a DPM model as a measure of the similarity of these gene expression profiles. Conversely, this probability can be used to define a dissimilarity measure, which, for the purposes of visualization, can be input to one of the standard linkage algorithms used for hierarchical clustering. Biologically plausible results are obtained from the Rosetta compendium of expression profiles which extend previously published cluster analyses of this data

    Inferring orthologous gene regulatory networks using interspecies data fusion

    Get PDF
    MOTIVATION: The ability to jointly learn gene regulatory networks (GRNs) in, or leverage GRNs between related species would allow the vast amount of legacy data obtained in model organisms to inform the GRNs of more complex, or economically or medically relevant counterparts. Examples include transferring information from Arabidopsis thaliana into related crop species for food security purposes, or from mice into humans for medical applications. Here we develop two related Bayesian approaches to network inference that allow GRNs to be jointly inferred in, or leveraged between, several related species: in one framework, network information is directly propagated between species; in the second hierarchical approach, network information is propagated via an unobserved 'hypernetwork'. In both frameworks, information about network similarity is captured via graph kernels, with the networks additionally informed by species-specific time series gene expression data, when available, using Gaussian processes to model the dynamics of gene expression. RESULTS: Results on in silico benchmarks demonstrate that joint inference, and leveraging of known networks between species, offers better accuracy than standalone inference. The direct propagation of network information via the non-hierarchical framework is more appropriate when there are relatively few species, while the hierarchical approach is better suited when there are many species. Both methods are robust to small amounts of mislabelling of orthologues. Finally, the use of Saccharomyces cerevisiae data and networks to inform inference of networks in the budding yeast Schizosaccharomyces pombe predicts a novel role in cell cycle regulation for Gas1 (SPAC19B12.02c), a 1,3-beta-glucanosyltransferase

    CSI : A nonparametric Bayesian approach to network inference from multiple perturbed time series gene expression data

    Get PDF
    How an organism responds to the environmental challenges it faces is heavily influenced by its gene regulatory networks (GRNs). Whilst most methods for inferring GRNs from time series mRNA expression data are only able to cope with single time series (or single perturbations with biological replicates), it is becoming increasingly common for several time series to be generated under different experimental conditions. The CSI algorithm (Klemm, 2008) represents one approach to inferring GRNs from multiple time series data, which has previously been shown to perform well on a variety of datasets (Penfold and Wild, 2011). Another challenge in network inference is the identification of condition specific GRNs i.e., identifying how a GRN is rewired under different conditions or different individuals. The Hierarchical Causal Structure Identification (HCSI) algorithm (Penfold et al., 2012) is one approach that allows inference of condition specific networks (Hickman et al., 2013), that has been shown to be more accurate at reconstructing known networks than inference on the individual datasets alone. Here we describe a MATLAB implementation of CSI/HCSI that includes fast approximate solutions to CSI as well as Markov Chain Monte Carlo implementations of both CSI and HCSI, together with a user-friendly GUI, with the intention of making the analysis of networks from multiple perturbed time series datasets more accessible to the wider community.1 The GUI itself guides the user through each stage of the analysis, from loading in the data, to parameter selection and visualisation of networks, and can be launched by typing >> csi into the MATLAB command line. For each step of the analysis, links to documentation and tutorials are available within the GUI, which includes documentation on visualisation and interacting with output file

    Evaluating nuclear proteincoding genes for phylogenetic utility in beetles.

    Get PDF
    a b s t r a c t Although nuclear protein-coding genes have proven broadly useful for phylogenetic inference, relatively few such genes are regularly employed in studies of Coleoptera, the most diverse insect order. We increase the number of loci available for beetle systematics by developing protocols for three genes previously unused in beetles (alpha-spectrin, RNA polymerase II and topoisomerase I) and by refining protocols for five genes already in use (arginine kinase, CAD, enolase, PEPCK and wingless). We evaluate the phylogenetic performance of each gene in a Bayesian framework against a presumably known test phylogeny. The test phylogeny covers 31 beetle specimens and two outgroup taxa of varying age, including three of the four extant beetle suborders and a denser sampling in Adephaga and in the carabid genus Bembidion. All eight genes perform well for Cenozoic divergences and accurately separate closely related species within Bembidion, but individual genes differ markedly in accuracy over the older Mesozoic and Permian divergences. The concatenated data reconstruct the test phylogeny with high support in both Bayesian and parsimony analyses, indicating that combining data from multiple nuclear loci will be a fruitful approach for assembling the beetle tree of life

    MDI-GPU: accelerating integrative modelling for genomic-scale data using GP-GPU computing.

    Get PDF
    The integration of multi-dimensional datasets remains a key challenge in systems biology and genomic medicine. Modern high-throughput technologies generate a broad array of different data types, providing distinct--but often complementary--information. However, the large amount of data adds burden to any inference task. Flexible Bayesian methods may reduce the necessity for strong modelling assumptions, but can also increase the computational burden. We present an improved implementation of a Bayesian correlated clustering algorithm, that permits integrated clustering to be routinely performed across multiple datasets, each with tens of thousands of items. By exploiting GPU based computation, we are able to improve runtime performance of the algorithm by almost four orders of magnitude. This permits analysis across genomic-scale data sets, greatly expanding the range of applications over those originally possible. MDI is available here: http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/

    Discovering transcriptional modules by Bayesian data integration

    Get PDF
    Motivation: We present a method for directly inferring transcriptional modules (TMs) by integrating gene expression and transcription factor binding (ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture model to allow data fusion on a gene-by-gene basis. This encodes the intuition that co-expression and co-regulation are not necessarily equivalent and hence we do not expect all genes to group similarly in both datasets. In particular, it allows us to identify the subset of genes that share the same structure of transcriptional modules in both datasets. Results: We find that by working on a gene-by-gene basis, our model is able to extract clusters with greater functional coherence than existing methods. By combining gene expression and transcription factor binding (ChIP-chip) data in this way, we are better able to determine the groups of genes that are most likely to represent underlying TMs

    Behavioral variation across the days and lives of honey bees

    Get PDF
    In honey bee colonies, workers generally change tasks with age (from brood care, to nest work, to foraging). While these trends are well established, our understanding of how individuals distribute tasks during a day, and how individuals differ in their lifetime behavioral trajectories, is limited. Here, we use automated tracking to obtain long-term data on 4,100+ bees tracked continuously at 3 Hz, across an entire summer, and use behavioral metrics to compare behavior at different timescales. Considering single days, we describe how bees differ in space use, detection, and movement. Analyzing the behavior exhibited across their entire lives, we find consistent inter-individual differences in the movement characteristics of individuals. Bees also differ in how quickly they transition through behavioral space to ultimately become foragers, with fast-transitioning bees living the shortest lives. Our analysis framework provides a quantitative approach to describe individual behavioral variation within a colony from single days to entire lifetimes

    ParticleMDI - Particle Monte Carlo methods for the cluster analysis of multiple datasets with applications to cancer subtype identification

    Get PDF
    We present a novel nonparametric Bayesian approach for performing cluster analysis in a context where observational units have data arising from multiple sources. Our approach uses a particle Gibbs sampler for inference in which cluster allocations are jointly updated using a conditional particle filter within a Gibbs sampler, improving the mixing of the MCMC chain. We develop several approaches to improving the computational performance of our algorithm. These methods can achieve greater than an order-of-magnitude improvement in performance at no cost to accuracy and can be applied more broadly to Bayesian inference for mixture models with a single dataset. We apply our algorithm to the discovery of risk cohorts amongst 243 patients presenting with kidney renal clear cell carcinoma, using samples from the Cancer Genome Atlas, for which there are gene expression, copy number variation, DNA methylation, protein expression and microRNA data. We identify 4 distinct consensus subtypes and show they are prognostic for survival rate (p < 0.0001)

    Far Infrared and Submillimeter Emission from Galactic and Extragalactic Photo-Dissociation Regions

    Get PDF
    Photodissociation Region (PDR) models are computed over a wide range of physical conditions, from those appropriate to giant molecular clouds illuminated by the interstellar radiation field to the conditions experienced by circumstellar disks very close to hot massive stars. These models use the most up-to-date values of atomic and molecular data, the most current chemical rate coefficients, and the newest grain photoelectric heating rates which include treatments of small grains and large molecules. In addition, we examine the effects of metallicity and cloud extinction on the predicted line intensities. Results are presented for PDR models with densities over the range n=10^1-10^7 cm^-3 and for incident far-ultraviolet radiation fields over the range G_0=10^-0.5-10^6.5, for metallicities Z=1 and 0.1 times the local Galactic value, and for a range of PDR cloud sizes. We present line strength and/or line ratio plots for a variety of useful PDR diagnostics: [C II] 158 micron, [O I] 63 and 145 micron, [C I] 370 and 609 micron, CO J=1-0, J=2-1, J=3-2, J=6-5 and J=15-14, as well as the strength of the far-infrared continuum. These plots will be useful for the interpretation of Galactic and extragalactic far infrared and submillimeter spectra observable with ISO, SOFIA, SWAS, FIRST and other orbital and suborbital platforms. As examples, we apply our results to ISO and ground based observations of M82, NGC 278, and the Large Magellenic Cloud.Comment: 54 pages, 20 figures, accepted for publication in The Astrophysical Journa
    corecore